We hope to explore the relative influence of physical traits, environmental conditions and species identity on the growth rate of trees. A gradient boosted model seems like a good candidate for this work since they:
We, first, converted the environmental variables to principle
components as they were highly correlated. We visualized the PCA and
used the eginvectors to help figure which environmental condition best
explained that PC. There were 5 -
Soil Fertility, Light, Temperature, pH, Soil.Humidity.Depth, and Slope.
We want to ensure that the plant traits are not correlated. Past work suggests that they are not easily represented using a PCA. So, we will not use the this feature reduction method.
A gradient boosted machine/model is a machine learning model that uses decision trees to fit the data.
A decision tree first starts with all of the observations, then, from the variables provided, it tries to figure out which variable split would result in the “purest” groupings of the data. So, in this case, it would try to place rows with higher growth rates in one node, and those with lower growth rates in another node.
GBMs are an ensemble of decision trees, nut they are fit
sequentially. We call GBMs an ensemble of weak learners as each
subsequent tree is an attempt to correct the errors of the previous
tree. Thus, while one tree, by itself, can not describe the
relationships, with the use of all the trees, we can. Below is a figure
by Bradly Bohemke that attempts to illustrate how each subsequent tree
improves the fit on the data.
We compared the fit of three used a gradient boosted models to determine how environmental gradients and physical traits influence RGR:
Though we present outputs for all three models below, we show that
the best model is Model 1, using caret::resamples. This
function allows us to iteratively build models using the training data
and measure performance each round. In the end, we have re-sampled
measures of each performance metric - r squared, RSME and MSE. Before
comparing model performance, however, we first train the models - this
is done to help in determining a range of parameters that would fit the
data best for caret::resamples.
Below, we show the best parameters for the models, given the data. ### Model 1: Tree Age + Plant Traits + Environmental Conditions{.tabset}
## $model_id
## [1] "final_grid_model_148"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 3
##
## $min_rows
## [1] 4
##
## $nbins
## [1] 32
##
## $nbins_cats
## [1] 256
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "deviance"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3574.007
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.28
##
## $col_sample_rate
## [1] 0.55
##
## $col_sample_rate_per_tree
## [1] 0.27
##
## $min_split_improvement
## [1] 0
##
## $histogram_type
## [1] "RoundRobin"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH"
## [5] "Slope" "Estem" "Branching.Distance" "Stem.Wood.Density"
## [9] "Leaf.Area" "LMA" "LCC" "LNC"
## [13] "LPC" "d15N" "t.b2" "Ks"
## [17] "Ktwig" "Huber.Value" "X.Lum" "VD"
## [21] "X.Sapwood" "d13C" "Tree.Age" "julian.date.2011"
##
## $y
## [1] "BAI_GR"
## $model_id
## [1] "final_grid_model_12"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 11
##
## $min_rows
## [1] 4
##
## $nbins
## [1] 256
##
## $nbins_cats
## [1] 4096
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "deviance"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3594.991
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.82
##
## $col_sample_rate
## [1] 0.7
##
## $col_sample_rate_per_tree
## [1] 0.92
##
## $min_split_improvement
## [1] 0
##
## $histogram_type
## [1] "QuantilesGlobal"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope"
## [6] "Species" "Tree.Age" "julian.date.2011"
##
## $y
## [1] "BAI_GR"
## $model_id
## [1] "final_grid_model_77"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 6
##
## $min_rows
## [1] 2
##
## $nbins
## [1] 32
##
## $nbins_cats
## [1] 4096
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "deviance"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3547.872
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.2
##
## $col_sample_rate
## [1] 0.7
##
## $col_sample_rate_per_tree
## [1] 0.85
##
## $min_split_improvement
## [1] 1e-08
##
## $histogram_type
## [1] "RoundRobin"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH"
## [5] "Slope" "Estem" "Branching.Distance" "Stem.Wood.Density"
## [9] "Leaf.Area" "LMA" "LCC" "LNC"
## [13] "LPC" "d15N" "t.b2" "Ks"
## [17] "Ktwig" "Huber.Value" "X.Lum" "VD"
## [21] "X.Sapwood" "d13C" "Species" "Tree.Age"
## [25] "julian.date.2011"
##
## $y
## [1] "BAI_GR"
Here, we present results from model comparisons and show that the best model is Model 1 - Plant Traits + Environmental Conditions.
##
## Call:
## summary.resamples(object = ModelPerformanceCompare)
##
## Models: Model 1 - Plant Traits + Environmental Conditions, Model 2 - Species Identity + Environmental Conditions, Model 3 - Species Identity + Environmental Conditions + Plant Traits
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median
## Model 1 - Plant Traits + Environmental Conditions 0.6207895 0.6957431 0.7265563
## Model 2 - Species Identity + Environmental Conditions 0.7598406 0.8467461 0.8879756
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 0.7008084 0.7904407 0.8166411
## Mean 3rd Qu. Max. NA's
## Model 1 - Plant Traits + Environmental Conditions 0.7408249 0.7904460 0.8598637 0
## Model 2 - Species Identity + Environmental Conditions 0.8964071 0.9432935 1.0775755 0
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 0.8183526 0.8591769 0.9161170 0
##
## RMSE
## Min. 1st Qu. Median
## Model 1 - Plant Traits + Environmental Conditions 0.8024465 0.9154721 1.014256
## Model 2 - Species Identity + Environmental Conditions 1.0508989 1.1441543 1.214355
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 0.9273804 1.0402613 1.092019
## Mean 3rd Qu. Max. NA's
## Model 1 - Plant Traits + Environmental Conditions 1.012689 1.061483 1.250678 0
## Model 2 - Species Identity + Environmental Conditions 1.214074 1.258653 1.419486 0
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 1.096026 1.125715 1.333776 0
##
## Rsquared
## Min. 1st Qu. Median
## Model 1 - Plant Traits + Environmental Conditions 2.158172e-02 0.093931513 0.13196451
## Model 2 - Species Identity + Environmental Conditions 1.349711e-05 0.002559265 0.01320337
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 4.107481e-04 0.021497133 0.07528740
## Mean 3rd Qu. Max.
## Model 1 - Plant Traits + Environmental Conditions 0.13189257 0.17315454 0.2605188
## Model 2 - Species Identity + Environmental Conditions 0.02671596 0.02832347 0.1401977
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 0.07068428 0.10458316 0.1884573
## NA's
## Model 1 - Plant Traits + Environmental Conditions 0
## Model 2 - Species Identity + Environmental Conditions 0
## Model 3 - Species Identity + Environmental Conditions + Plant Traits 0
Now, we can build the model.
set.seed(123)
gbm_regressor_bai_residuals <-
gbm(BAI_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep,
PlantTraitsKeep, "Tree.Age", "BAI_GR", "julian.date.2011"))),
n.trees = 1000,
interaction.depth = 3, #max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 13, #col_sample_rate
bag.fraction = 0.28, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
Let’s explore the interactions in these data.
##
## Kruskal-Wallis rank sum test
##
## data: Value by Class
## Kruskal-Wallis chi-squared = 22.488, df = 2, p-value = 1.309e-05
## Comparison Z
## 1 Environmental Conditions:Environmental Conditions - Plant Traits:Environmental Conditions -3.422308
## 2 Environmental Conditions:Environmental Conditions - Plant Traits:Plant Traits -4.679483
## 3 Plant Traits:Environmental Conditions - Plant Traits:Plant Traits -1.405564
## P.unadj P.adj
## 1 6.209183e-04 1.241837e-03
## 2 2.875992e-06 8.627977e-06
## 3 1.598537e-01 1.598537e-01
Now, we plot interactions with values>0.10.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
The model below is a GLM with the main effects and top 22 interactions from the best model. This will be used as a baseline to compare to other models Julie found in literature. We caution that this is an exploratory model, and is not a true translation of the GBM as we do not model non-linearity.
##
## Call:
## glm(formula = TestModelFormula, data = rgr_msh_na)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9145 -0.5267 -0.1461 0.3479 3.4778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.958e+00 3.367e+00 1.770 0.07822 .
## Soil.Fertility 7.276e-01 1.400e+00 0.520 0.60372
## Light 1.675e+00 1.492e+00 1.123 0.26266
## Temperature 7.468e-02 9.302e-02 0.803 0.42293
## pH -1.725e-01 1.536e-01 -1.123 0.26277
## Slope -9.130e-03 1.553e-01 -0.059 0.95318
## Estem 1.961e-05 2.406e-05 0.815 0.41592
## Branching.Distance -2.833e-02 1.644e-02 -1.724 0.08620 .
## Stem.Wood.Density 5.656e-01 9.061e-01 0.624 0.53318
## Leaf.Area 6.877e-04 4.916e-03 0.140 0.88888
## LMA 1.925e-02 5.804e-03 3.316 0.00107 **
## LCC -3.369e-02 3.063e-02 -1.100 0.27266
## LNC -9.213e-03 2.649e-01 -0.035 0.97229
## LPC -1.404e-02 2.927e-02 -0.480 0.63206
## d15N -1.363e-01 1.540e-01 -0.885 0.37729
## t.b2 -1.047e+01 1.557e+01 -0.673 0.50171
## Ks -1.394e-02 1.288e-02 -1.082 0.28041
## Ktwig 1.488e-04 5.997e-04 0.248 0.80431
## Huber.Value 2.936e+00 5.575e+00 0.527 0.59898
## X.Lum -4.015e-01 7.229e+00 -0.056 0.95576
## VD 1.626e-04 7.483e-05 2.172 0.03094 *
## X.Sapwood -4.738e-01 1.180e+00 -0.401 0.68857
## d13C 1.420e-01 6.873e-02 2.066 0.04003 *
## Tree.Age -4.475e-03 1.027e-02 -0.436 0.66335
## julian.date.2011 3.263e-03 8.494e-03 0.384 0.70127
## LPC:Ktwig -8.664e-07 2.904e-05 -0.030 0.97623
## LNC:Ks 3.844e-03 2.099e-03 1.831 0.06843 .
## Huber.Value:Tree.Age -3.513e-03 1.821e-01 -0.019 0.98463
## Soil.Fertility:Stem.Wood.Density 8.891e-01 5.171e-01 1.719 0.08698 .
## d15N:Huber.Value -1.041e+00 7.993e-01 -1.302 0.19419
## VD:X.Sapwood 6.428e-05 5.786e-05 1.111 0.26787
## Ks:julian.date.2011 2.216e-06 6.299e-05 0.035 0.97196
## Slope:VD 1.377e-05 7.893e-06 1.744 0.08254 .
## Light:d13C 5.226e-02 4.916e-02 1.063 0.28901
## X.Lum:X.Sapwood -3.803e+00 8.882e+00 -0.428 0.66894
## Soil.Fertility:Huber.Value 1.196e+00 1.534e+00 0.779 0.43657
## Soil.Fertility:X.Sapwood -3.235e-03 3.966e-01 -0.008 0.99350
## Soil.Fertility:LCC -2.948e-02 2.934e-02 -1.005 0.31624
## Branching.Distance:LNC 1.390e-02 5.980e-03 2.325 0.02102 *
## LNC:VD -6.617e-05 2.835e-05 -2.334 0.02052 *
## Leaf.Area:X.Sapwood 1.531e-03 7.118e-03 0.215 0.82990
## Stem.Wood.Density:d15N 3.758e-01 2.250e-01 1.670 0.09630 .
## Leaf.Area:Tree.Age -1.297e-04 1.262e-04 -1.028 0.30512
## Slope:Tree.Age -3.017e-03 6.087e-03 -0.496 0.62063
## t.b2:Ktwig 1.021e-02 1.334e-02 0.765 0.44510
## pH:Ktwig -4.869e-05 1.024e-04 -0.475 0.63497
## d15N:Ktwig -2.259e-05 4.167e-05 -0.542 0.58827
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.8795727)
##
## Null deviance: 283.97 on 259 degrees of freedom
## Residual deviance: 187.35 on 213 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 748.64
##
## Number of Fisher Scoring iterations: 2
## McFadden's R-squared for model is 0.34025566703557